Although the origins of the Internet trace back to the late 1960s, the more recently-developed Worldwide Web (“Web”), together with the long-established Usenet, have revolutionized accessibility to untold volumes of information in stored electronic form to a worldwide audience, including written, spoken (audio) and visual (imagery and video) information, both in archived and real-time formats. The Web provides information via interconnected Web pages that can be navigated through embedded hyperlinks. In short, the Web provides desktop access to a virtually unlimited library of information in almost every language.
Web content is relatively unstructured in terms of grammar and standardized usage. Web content is often presented in the form of excerpts, which are primarily short, self-contained narratives including one or more headlines and accompanying text. Excerpts might occur as an artifact of the graphical nature of the Web, which emphasizes the tabular presentation of information. In addition, grammatical rules are often ignored in Web content, which can be typified by incomplete sentences, improper capitalization and often bad prose.
In an attempt to improve the presentation and quality of the Web content, many Web content publishers, particularly those publishers who provide Web content submissions received from third parties, have implemented editorial guidelines, which provide a set of rules for acceptable style and grammar. Editorial guidelines strive to provide improved appearance and uniformity, but may not necessarily attempt to enforce correct grammar. Editorial guidelines often function as a pre-condition to Web content publication and compliance can be difficult if a Web content submission is created through automated means. Moreover, compliance is particularly problematic for users who have a significant body of contributions, such as a Web retailer with a large product catalog that would be difficult to fully evaluate for editorial guideline compliance.
Third party advertisers, in particular, can be at odds with editorial guidelines, yet can benefit by advertising on-line. Compliance is important because the Web provides a vehicle to reach a potentially large audience inexpensively. Advertisements can be provided with existing Web content, such as in conjunction with on-line news and information. Advertisements can also be tied to results generated by search engines to build on the topical nature of the underlining query.
Web-based advertisements also tend to be unstructured and often contain only nouns, adjectives, conjunctions, and prepositions with little or no punctuation. Improper capitalization often occurs in the product or service name description. However, improper capitalization can render an advertisement ineligible for display by a Web content publisher that enforces correct capitalization and similar grammatical conventions.
Conventional approaches to ensuring compliance with editorial guidelines and similar requirements often employ manual or rote correction of word capitalization. However, such approaches can be slow, time-consuming and expensive. Moreover, blanket capitalization correction can overcompensate by removing non-standard and “unusual” forms of acceptable capitalization, such as found in certain proper nouns. For instance, “PlayStation” is a properly capitalized registered trademark. Blanket capitalization correction can be particularly impractical for a large number of product or service advertisements.
Therefore, there is a need to improve capitalization correction of words identified in excerpts from, for instance, Web content. Preferably, such an approach would enforce grammatical and editorial guideline conventions and would accommodate frequently occurring yet non-standard capitalization variations.
There is a further need for generating a lexicon containing capitalization variations for use in capitalization correction. Preferably, such an generation should facilitate grammatical and editorial guideline compliance.